Skip to content

fast-interp: legacy exception handling (try/catch/rethrow/delegate/tag)#4949

Open
matthargett wants to merge 16 commits into
bytecodealliance:mainfrom
rebeckerspecialties:feat/legacy-eh-fast-interp-full
Open

fast-interp: legacy exception handling (try/catch/rethrow/delegate/tag)#4949
matthargett wants to merge 16 commits into
bytecodealliance:mainfrom
rebeckerspecialties:feat/legacy-eh-fast-interp-full

Conversation

@matthargett
Copy link
Copy Markdown

@matthargett matthargett commented May 21, 2026

Lifts the cmake guard at build-scripts/unsupported_combination.cmake:67 that forbids
WAMR_BUILD_EXCE_HANDLING=1 WAMR_BUILD_FAST_INTERP=1 and adds the matching dispatch
loop coverage in wasm_interp_fast.c — loader-side EH metadata table, runtime
EH-frame stack, catch-walk for throw / rethrow, delegate forward-to-outer,
tag-with-params payload routing, and result-typed try-region COPY-at-CATCH
alignment. I tried to keep each step bisectable.

Why we built this: we're replacing WasmEdge with WAMR fast-interp as the wasm
runtime in a pure-interpreter App-Store-eligible app, and a migration
blocker is graphql-validation compiled by Porffor — JS-to-wasm output that lowers
try/catch/throw to the wasm-exceptions section. Without EH enabled, fast-interp
rejects the binary at load with invalid section id; with EXCE_HANDLING +
CLASSIC_INTERP it loads but fast-interp is 1.3–1.8× faster on every benchmark we
ran.

Cross-microarch benchmarks: M4 Lion P / M4 Sawtooth E / A14 Icestorm (iPhone 12) / A12 Tempest (iPhone XS) /
S8 (Watch SE2) at
https://github.com/rebeckerspecialties/wasm-benchmark/blob/claude/relaxed-simd-diff-fuzz/README.md#cross-runtime-results-across-apple-silicon-e-cores
. Integration tests in our benchmark repo include a Porffor-compiled
graphql-validation workload that mirrors the real-world try { visit(…) } catch (e) { if (e !== abortObj) throw e; } shape and exercises every EH opcode the loader
emits. ASan + UBSan builds are part of the local dev loop.

Companion PR: relaxed-SIMD fast-interp opcode lowering, posted separately
(f32x4.relaxed_madd etc).

Validated existing benchmarks perform nearly exactly the same in terms of wallclock, throughput, cache, and branch predictor using CPU bottlneck template in xctrace.

…terp

Enables WAMR_BUILD_EXCE_HANDLING=1 together with FAST_INTERP=1 for the
*throw-only* subset of the legacy wasm-eh proposal — modules that
declare tags and execute `throw`/`rethrow` but never define a same-
function `try`/`catch` handler. The throw escapes via the existing
`got_exception` bailout path, exactly like any other trap, and the
host sees the exception via `wasm_runtime_get_exception`.

This is the shape produced in the wild by Porffor (the JS-to-wasm
compiler used by Fastly's StarlingMonkey): its graphql-validation
benchmark we measure cross-runtime contains 561 `throw` opcodes and
zero in-wasm try/catch handlers. Every JS throw escapes to the host
JS engine, which is the typical Porffor / static-JS-to-wasm pattern.

Three changes:

  * `build-scripts/unsupported_combination.cmake` — lift the
    EXCE_HANDLING + FAST_INTERP ban (with a comment explaining the
    scope: throw-only is supported, in-function try/catch is the
    natural follow-up).

  * `core/iwasm/interpreter/wasm_loader.c` — when fast-interp parses
    WASM_OP_THROW, emit the tag index as a uint32 immediate after
    the auto-emitted THROW opcode. Same shape as how WASM_OP_CALL
    emits its funcidx.

  * `core/iwasm/interpreter/wasm_interp_fast.c` — `HANDLE_OP(WASM_OP
    _THROW)` now reads the uint32 immediate, surfaces a tag-bearing
    exception via `wasm_set_exception`, and falls through to
    `got_exception`. The other legacy-EH ops (TRY / CATCH /
    CATCH_ALL / RETHROW / DELEGATE / EXT_OP_TRY) keep the existing
    "unsupported opcode" diagnostic — they're unreachable for
    fast-interp-compiled code today (the loader's fast-interp path
    treats TRY as a plain block via skip_label and never emits
    CATCH-family opcodes into the IR), so the diagnostic only fires
    if a future loader change starts emitting them.

Validated end-to-end on aarch64-apple-darwin: a benchmark-core harness
loads Porffor's graphql-validation-porf.wasm, runs `m()` (the export
that drives the validation pipeline), and gets `result=0` — matching
the cross-runtime consensus from wasmtime / WasmEdge interpreter.
Before this PR the same workload failed at LOAD with "invalid section
id" (the tag section couldn't be parsed without EXCE_HANDLING=1).

Full same-function try/catch lowering — porting the classic
interpreter's `find_a_catch_handler` design to fast-interp's slot-
allocator + pre-decoded IR — is the natural follow-up.
Adds per-function `WASMFastEHEntry[]` (sized by the existing
`func->exception_handler_count` field, allocated in pass 2 of the
preprocess pass and freed in `wasm_loader_unload`) recording each
try-region's catch handler pcs in the rewritten fast-interp IR.
This is the data the upcoming runtime EH-frame stack will consult
when a `throw` walks for a matching catch handler — it is *not yet
used* in this commit.

Three pieces of plumbing on the loader side:

  * `WASMFastEHCatch` / `WASMFastEHEntry` typedefs in `wasm.h`,
    plus a `WASMFunction.exception_handlers` field. The struct is
    gated on `WASM_ENABLE_EXCE_HANDLING && WASM_ENABLE_FAST_INTERP`
    so classic-interp builds are byte-identical.

  * `BranchBlock.eh_entry_idx` (loader-internal CSP slot) and
    `WASMLoaderContext.cur_eh_entry_idx` (the source-order cursor).
    These let CATCH / CATCH_ALL / DELEGATE / END handlers resolve
    back to the right try-region without walking the CSP at
    runtime — same pattern the existing fast-interp loader uses to
    pre-patch BR / BR_IF / BR_TABLE targets.

  * Pass-2-only populate logic on the existing CATCH, CATCH_ALL,
    DELEGATE, and END cases. The pass-1 increment of
    `exception_handler_count` is now gated on
    `loader_ctx->p_code_compiled == NULL` so it doesn't double-
    count when the loader re_scans for the second traverse.

Runtime behavior is unchanged in this commit: CATCH / CATCH_ALL /
RETHROW / DELEGATE still hit the "unsupported opcode" stub from
the throw-only patch. The dispatch wiring lands in the next
commit; this one establishes the data layout reviewers will
sanity-check first.

Cost-model note: no changes to any hot-op handler (CALL, LOAD,
STORE) and the new struct fields are entirely behind the existing
WASM_ENABLE_EXCE_HANDLING guard, matching classic-interp's posture
where EH-on builds carry one byte store per PUSH_CSP and a small
per-frame allocation but leave hot ops untaxed.
Wires up the per-frame eh-stack that commit 1 laid the metadata for.
A program can now enter and exit a try-region without aborting; same-
function throw → catch dispatch still bails out via got_exception
(follow-up commit hooks that up).

Frame layout: one extra cell per try-region appended past the value
stack in the existing frame->operand[] allocation, sized by
cur_wasm_func->exception_handler_count. Functions without try blocks
pay zero cells. WASMInterpFrame gains a `uint32 eh_count` (the eh-
stack top), clustered next to the existing EH-gated
exception_raised/tag_index fields — same cache line, cold path only.

Hot-op invariants preserved:
  * No new instructions in HANDLE_OP(WASM_OP_CALL),
    HANDLE_OP(WASM_OP_*_LOAD_*), HANDLE_OP(WASM_OP_*_STORE_*).
  * Dispatch table size is unchanged (slots 0x06 = WASM_OP_TRY, 0x07 =
    WASM_OP_CATCH, 0x0b = WASM_OP_END, 0x19 = WASM_OP_CATCH_ALL just
    get new bodies — they previously fell through to the
    "unsupported opcode" stub).
  * eh_count writes/reads only happen on TRY/CATCH/CATCH_ALL/END,
    none of which are on the dispatch loop's hot path.

Loader changes (wasm_loader.c):
  * WASM_OP_TRY no longer skip_labels; emits its `eh_idx:u32`
    immediate after the auto-emitted opcode byte so the runtime push
    handler can find the right exception_handlers[] entry.
  * WASM_OP_CATCH / CATCH_ALL emit the same `eh_idx:u32` immediate;
    the runtime handler reads it to find end_of_region_pc to branch
    to on normal-flow exit.
  * WASM_OP_END for try-regions keeps the END byte in the IR (with
    the patch-list rewind dance to make `br N`-targeted PATCH_END
    addresses land *on* the END byte so the pop runs for branches
    too, not just fall-through).

Runtime handlers (wasm_interp_fast.c):
  * HANDLE_OP(WASM_OP_TRY) pushes eh_idx onto frame_lp[eh_offset +
    eh_count] and increments eh_count.
  * HANDLE_OP(WASM_OP_CATCH) and HANDLE_OP(WASM_OP_CATCH_ALL) share a
    body: decrement eh_count, set frame_ip to
    func->exception_handlers[eh_idx].end_of_region_pc.
  * HANDLE_OP(WASM_OP_END) moves out of the "unsupported opcode"
    block when EXCE_HANDLING is enabled; decrements eh_count.
  * WASM_OP_RETHROW / WASM_OP_DELEGATE / EXT_OP_TRY still route to
    the diagnostic — wired up in a follow-up commit.

After this commit: programs with try-regions where no throw fires
inside the try body run correctly (the eh-stack is correctly
maintained through entry/exit). Throws inside try bodies still
escape via got_exception, matching the throw-only patch's behavior.
porf-accurate still errors at the first throw escape (its catch
handler does real work; full catch dispatch is the next commit).
Activates same-function and inter-function catch dispatch for the
*void-result* try-region shape (which is what graphql-validation-
porf-accurate emits — `06 40` = try-with-blocktype-void). Programs
that throw inside a void try body now land in the matching catch
handler (or catch_all) instead of escaping to the host trap path.
The eh-stack push/pop infrastructure from the prior commit gives us
the in-scope handlers; this commit adds the walk and the cross-frame
unwind.

Hot-op cost-model check:
  * HANDLE_OP(WASM_OP_THROW) is itself a cold op — programs that
    never throw never enter it. The walk runs in find_a_catch_
    handler, also cold.
  * The one new check on a path every wasm-to-wasm call return
    visits is the `if (frame->exception_raised)` branch in
    return_func. Predicted strongly not-taken (exceptions are
    rare); two AArch64 instructions; identical in shape to
    classic-interp's existing check at wasm_interp_classic.c:6877.
  * The eh-stack cells share the cache line with the value stack
    they're allocated next to, so the walk hits warm memory.
  * CALL / LOAD / STORE handlers are byte-identical to the no-EH
    path.

Mechanism:
  * `find_a_catch_handler` is a labeled block reached either by
    WASM_OP_THROW or by return_func when a callee stashed a tag
    on this frame. It walks frame->eh_count entries top-down,
    skipping entries whose top bit is set (state CATCH — already
    in an active handler; throws raised inside skip outward).
    On a tag match it ORs in EH_TRY_CATCH_STATE_BIT and dispatches
    frame_ip to entry->catches[j].handler_pc (or
    entry->catch_all_pc when no typed clause matches).
  * On exhaustion, the walker stashes exception_tag_index on
    prev_frame->tag_index, sets prev_frame->exception_raised = true,
    and goes to return_func. return_func, after RECOVER_CONTEXT
    has restored the caller's context, re-enters
    find_a_catch_handler with the caller's frame in scope.
  * At the top of the wasm stack (prev_frame->ip == NULL) the
    walker takes the existing got_exception escape so the host
    can read the trap message via wasm_runtime_get_exception.
  * frame->exception_raised and frame->tag_index are pre-existing
    fields originally added for classic-interp. exception_raised
    must now be cleared on every fast-interp frame setup — ALLOC_
    FRAME doesn't zero-init the header and a stale non-zero byte
    trips the return_func check on every call return.

Loader-side bug fix: the CATCH and CATCH_ALL emit_uint32(eh_idx)
calls used to live inside the `if (loader_ctx->p_code_compiled !=
NULL)` populate guard. That gating skipped them in pass 1 but ran
them in pass 2, so pass 2 wrote 4 bytes per catch *past* the
code_compiled buffer allocated based on pass 1's measurement. The
overrun corrupted whatever loader allocation the heap placed
immediately after — typically func->exception_handlers itself (the
first 4 bytes of entry[0], i.e. catch_count, was the usual victim).
Surfaced as "wasm exception thrown (tag 0)" on `test_local_throw`
where the typed-catch's catches[] array showed count=0 at runtime
even though the loader populated count=1 in pass 2 — the populate
itself wrote correctly, then a later opcode's reserve_block_ret
overran the buffer and zeroed catch_count. Moved both emit_uint32
calls outside the populate guard so both passes account for the
4-byte immediate.

State encoding: each eh-stack cell packs the loader's
exception_handlers[] index in the low 31 bits and a state bit
(EH_TRY_CATCH_STATE_BIT) in the top bit. No cell-count change vs
the prior commit; same per-frame allocation footprint.

Known limitation: try-regions with a non-void result-type are not
yet supported by the *normal-flow* path. The fix is a loader-side
try-body→block-dynamic-offset COPY emit at CATCH processing time
(mirrors how WASM_OP_ELSE aligns the if-body's result via
reserve_block_ret). See AGENTS.md's "Open follow-up — WAMR fast-
interp legacy exception handling" section. graphql-validation-porf-
accurate uses void-result try-blocks so it isn't blocked by this.

Verified by `crates/benchmark-core/src/bin/probe_eh_void.rs` (5
cases — typed catch, catch_all, inter-function unwind, nested,
no-throw — all PASS) and the existing run_graphql_validation_wamr
regression (AS / porf-fast / porf-accurate within run-to-run
variance vs the prior commit).
Activates the RETHROW opcode: re-raise the exception currently being
handled by the (depth+1)-th `state=CATCH` entry from the top of the
per-frame eh-stack. Source form `rethrow N` becomes `RETHROW <N:u32>`
in the rewritten IR; the runtime walker scans the eh-stack top-down,
skips state=TRY entries (they're not "catch handlers in progress"),
and on the (depth+1)-th state=CATCH match reads its stashed caught
tag and dispatches to `find_a_catch_handler` exactly as a fresh
throw with that tag would.

Storage shape: each eh-stack entry is now `EH_ENTRY_CELLS = 2` cells
wide. Cell 0 packs `eh_idx | EH_TRY_CATCH_STATE_BIT` (unchanged); cell
1 holds the wasm tag index of the exception currently being handled
on that entry (undefined while the entry is in TRY state — the throw
walker writes it on catch dispatch). Frame allocation grows by
`exception_handler_count * 2` cells per call; functions without try
blocks still pay zero cells.

Hot-op cost-model check:
  * No new code in HANDLE_OP(WASM_OP_CALL) / LOAD_* / STORE_*.
  * RETHROW is a cold op (only fires inside catch bodies); the walk
    runs across at most the number of catches nested around the
    rethrow site.
  * TRY's push gained a no-op write (cell 1 stays undefined until
    the throw walker overwrites it on dispatch) — same one indexed
    store as before, just with a wider stride.
  * `frame->exception_raised` init + the return_func hook are
    unchanged from the prior commit; no new branches on any
    return path.

Loader-side land-mine cleared: WAMR's shared `check_branch_block`
calls `emit_br_info` unconditionally, which for a typical
arity-zero catch target writes 4 bytes (arity) + 8 bytes (target
ptr placeholder via `add_label_patch_to_list`) into the IR between
the auto-emitted opcode label and the next op. RETHROW doesn't
*branch* to its target — it walks the eh-stack — so those br-info
bytes are dead weight, and worse: they shift our depth immediate
past where the runtime `read_uint32(frame_ip)` looks for it. The
RETHROW case in the loader now does its own depth + label-type
validation (manual `loader_ctx->frame_csp - depth - 1` lookup,
LABEL_TYPE_CATCH/CATCH_ALL check) and skips check_branch_block
entirely.

Verified by three new cases in
`crates/benchmark-core/tests/eh_correctness.rs`:
  - `rethrow_depth_zero`: inner catch sets a flag, `rethrow 0`,
    outer catch sees the same tag (= 11).
  - `rethrow_preserves_tag`: two tags ($a, $b); throw $b → inner
    catch $b → rethrow 0; outer catch $b wins over outer catch $a
    (= 11).
  - `rethrow_depth_one`: nested catches; from inside the
    innermost (which caught $b), `rethrow 1` re-raises the
    *outer* catch's tag ($a). All 23 cases in the EH correctness
    suite pass; AS / porf-fast / porf-accurate benchmark medians
    overlap the prior commit's range within run-to-run variance
    (three runs each).
Wires up the runtime + loader for `try ... delegate N` so the throw
walker can re-raise the exception at the target block's location
without spending hot-op budget.

Loader (wasm_loader.c, WASM_OP_DELEGATE case):
  Skip the shared `check_branch_block_for_delegate` — its
  `emit_br_info` call would write 12 bytes of branch metadata
  between the auto-emitted DELEGATE label and the next op, dead
  weight at runtime and (worse) the same alignment-shift gotcha
  that bit RETHROW. Do the depth read + bounds check inline.
  In pass 2, count try/catch/catch_all blocks STRICTLY between the
  delegate's frame and the target block — that count (`delta`) is
  exactly how many eh-stack entries the runtime walker must skip
  past, by spec.

Runtime (wasm_interp_fast.c):
  * find_a_catch_handler: before catch-matching, check
    `entry->delegate_target_depth`. If set, mark the delegate's
    own eh-stack entry consumed (STATE bit) and do `i -= delta;
    continue;` so the for-loop's natural i-- lands on the first
    eh-stack entry strictly outside the target block. The
    `delta + 1 >= i` guard catches "delegate to function block"
    (target lies outside this function's eh-stack) and falls
    through to the existing "no handler in this frame"
    return_func path.
  * WASM_OP_DELEGATE: split out of the "unsupported opcode" stub
    into its own normal-flow handler — fires when the try body
    completes without throwing; just `frame->eh_count--` and
    advance.

Cost shape preserved: zero new bytes in CALL / LOAD / STORE; all
delegate work lives on the cold throw walker or the cold normal-
flow exit handler.
Wires up the loader + runtime path so a tagged exception with i32 /
i64 / v128 parameters delivers its payload to the matching catch
body's operand stack — same-function dispatch only. Cross-function
dispatch (callee throws, caller catches) still drops the payload;
that gap is now surfaced explicitly via the
`cross_function_tag_with_params` integration test (#[ignore]'d
with the same justification recorded in AGENTS.md).

WASMFastEHCatch grows two fields:
  uint32 param_cell_num;
  int16 *param_dst_offsets;

The dst-slots array is a loader-owned int16[] of length
`param_cell_num`, capturing the cell-wise frame_lp slot offsets
that the catch body's downstream ops will pop from. NULL for the
common tag-without-params case (Porffor's empty-payload tags, all
of the spec-test's `tag $err` declarations) — no heap allocation
and the runtime walker's copy loop is a trivial zero-iteration
no-op.

Loader (wasm_loader.c) — CATCH case:
  * Swap `PUSH_TYPE` for `PUSH_OFFSET_TYPE` so the catch body's
    incoming params get fresh `dynamic_offset` slots allocated +
    emitted as int16 operands in the IR (right after the eh_idx
    immediate). The PUSH_OFFSET_TYPE emits are dead bytes on the
    normal-flow CATCH dispatch (which only reads eh_idx and
    branches to end_of_region_pc), but they're necessary so the
    catch body's POP_OFFSET_TYPEs find the right slot offsets in
    frame_offset[].
  * Pass 2 captures handler_pc AFTER the PUSH_OFFSET_TYPEs so the
    throw walker's `frame_ip = handler_pc` lands at the first byte
    of the catch body proper (skipping the dead dst-slot bytes).
  * Pass 2 also bh_memcpy_s's frame_offset[]'s top
    `param_cell_num` cells into a fresh int16[] on the catch's
    WASMFastEHCatch — these are the destination offsets the
    runtime walker will write payload values to.
  * Free path in wasm_loader_unload extended to free the
    per-catch dst-offsets array.

Loader — THROW case (wasm_loader.c):
  * Moved the existing `emit_uint32(tag_index)` below the
    tag-type lookup + validation so `tag_type->param_cell_num` is
    available.
  * After tag_index, emit `<param_cell_num:u32>` plus
    `<src_offset_i:int16>` for i in 0..param_cell_num. The src
    offsets are read directly off the top of `loader_ctx->
    frame_offset[]` — the validation loop above pops frame_ref
    but doesn't touch frame_offset, so they're stable. Both
    traverses run the same emit to keep pass-1 / pass-2 size
    accounting balanced.

Runtime (wasm_interp_fast.c) — new locals in the dispatch
function (cold-path only, same scope as `exception_tag_index`):
  uint32 throw_param_cell_num = 0;
  int16 *throw_src_offsets = NULL;

These get populated by HANDLE_OP(WASM_OP_THROW), which now reads
tag_index + param_cell_num + the src-offsets array off the IR
(advancing frame_ip past all three). The pair is consumed by
find_a_catch_handler's catch-match dispatch: on a typed-catch
match it does the cell-wise copy `frame_lp[dst[c]] =
frame_lp[src[c]]`. catch_all dispatch explicitly drops the
payload (per spec — catch_all binds no exception values). The
copy loop is fully cold (only THROW reaches here); CALL / LOAD
/ STORE handlers untouched.

WASM_OP_RETHROW: extended to re-point throw_src_offsets at the
matched catch's `param_dst_offsets` before goto find_a_catch_
handler — so rethrow from inside a typed catch carries the same
payload outward. The catch body can't mutate the dst slots
(they're allocated from `dynamic_offset`, separate from the
local-slot range that local.set writes to), so the values are
still the original ones at rethrow time. Rethrow from inside a
catch_all (whose `param_dst_offsets == NULL`) falls back to
zero-cell — documented as a known limitation.

return_func hook: the cross-frame branch zeros throw_param_cell_
num and throw_src_offsets before the goto find_a_catch_handler,
since the callee's source slots live in a frame that's about to
be torn down — same payload-dropping semantics as the existing
cross-function-no-payload case, but explicit instead of
relying on uninitialized stack.

Cost shape preserved: zero new bytes in CALL / LOAD / STORE.
EH_ENTRY_CELLS still 2; no extra cells per try-region. The two
new locals get spilled by the compiler since the hot loop
doesn't reference them.
Two bugs surfaced once same-function tag-with-params actually got
exercised by integration tests:

1. **`PUSH_OFFSET_TYPE` is offset-only.** The CATCH loader was
   bumping `dynamic_offset` + `frame_offset[]` but never
   `stack_cell_num`, leaving the operand and ref stacks out of
   sync. The catch body's first consumer (e.g. `global.set $g`)
   then hit `wasm_loader_pop_frame_offset`'s polymorphic
   short-circuit — the CATCH block inherits the polymorphic flag
   from THROW's `SET_CUR_BLOCK_STACK_POLYMORPHIC_STATE` and with
   `available_stack_cell == 0` the pop silently returned without
   emitting the source-slot operand bytes. The consumer's
   runtime read then landed on heap garbage and crashed with
   SIGBUS / SIGSEGV. Fix: pair `PUSH_OFFSET_TYPE` with `PUSH_TYPE`
   (ref-only) so both stacks advance in lockstep.

2. **Multi-cell `frame_offset[]` entries are unreliable past
   the first cell.** `wasm_loader_push_frame_offset` writes a
   meaningful int16 only for the FIRST cell of a multi-cell
   value (i64, f64, v128); the subsequent cell entries are left
   uninitialized (just a pointer increment, no write). My pass-1
   THROW src-offset emit and pass-2 CATCH dst-offset capture
   were reading those uninitialized cells directly, producing
   garbage offsets for any param wider than 32 bits.
   Fix: walk params (not cells) and synthesize consecutive cell
   offsets `(first, first+1, ..., first+N-1)` per param, where
   `first = frame_offset[cell_so_far]`. Matches the runtime
   invariant that an N-cell value occupies N consecutive
   frame_lp cells.

3 new integration tests cover the fixes:
  * `tag_single_i64_param` — 2-cell payload
  * `tag_mixed_i32_i64_params` — exercises per-param cell
    synthesis (would fail if cell-walk offset by 1)
  * `repeated_throw_with_payload` — confirms catch-allocated
    dst slots get fresh writes every invocation

Plus a wat fix in `nested_try_with_params_inner_wins`: the
outer catch's body was `i32.const 999 / global.set $g`, leaving
the param on the operand stack at `end`. That was a latent bug
masked before tag-with-params support (PUSH_TYPE-only didn't
let the param "exist" for validation purposes). Now corrected
by adding an explicit `drop` so the catch body's stack
validates clean.

No hot-op cost change: all the new loader work is in the cold
CATCH / THROW preprocess paths, and the runtime walker copy
loop is unchanged.
`try (result T)` regions now route the try body's normal-flow
value into the block's `dynamic_offset` slot the same way `else`
routes the if-body's value via `reserve_block_ret`. The throw-
dispatch path's catch-body END already handled the catch's COPY
via the existing reserve_block_ret call; this patch fills the
remaining gap by injecting a COPY before each CATCH/CATCH_ALL
label so the normal-flow exit (try body completes without
throwing → falls through to CATCH → CATCH runtime handler jumps
to end_of_region_pc) also deposits the value at the right slot.

Loader (wasm_loader.c):
  * WASM_OP_CATCH and WASM_OP_CATCH_ALL: before the existing
    emit_uint32(eh_idx) emit, call `check_block_stack` on the
    previous body (the try body on the first CATCH; the prior
    catch body on subsequent ones) and emit an
    EXT_OP_COPY_STACK_TOP / _I64 / _V128 if the body's last cell
    isn't already at `cur_block->dynamic_offset`. The
    `src != dst` predicate runs in both passes; the sign-stable
    nature of dynamic_offset (≥ 0) vs const-pool slots (≤ -1)
    keeps pass-1 size accounting and pass-2 writes aligned even
    though const-pool slots get renumbered by the qsort/dedup at
    the start of pass 2.
  * Both cases now also `SET_CUR_BLOCK_STACK_POLYMORPHIC_STATE
    (false)` after `RESET_STACK()`, matching how `WASM_OP_ELSE`
    resets the if-body's polymorphic flag. Without this reset, a
    catch body following a throw inherits the polymorphic state
    and `check_block_stack` at END takes the polymorphic branch
    (`POP_OFFSET_TYPE` → 2 bytes per return-cell emitted). Those
    bytes land between the auto-emitted END label and the EH-END
    branch's `skip_label()`, shifting the re-emitted END label
    forward and leaving a corrupt handler-ptr at the recorded
    `handler_pc` — SIGSEGV on the first dispatch.

Multi-return-value try-regions get an explicit "not yet
supported" error; they need `EXT_OP_COPY_STACK_VALUES` emit
support that's not in this commit. Single-return-value covers
every shape Porffor / AS / our 51-case integration suite emits.

6 new result-typed integration tests (single i32 / i64, with
and without throw, multi-catch picked by tag, catch_all
fallback, mixed-with-locals slot allocation). Plus a wat fix in
`multiple_catches_with_params_pick_by_tag`: the `catch $a` body
left its param on the operand stack before the catch-to-catch
transition. The previous loader didn't validate catch
transitions, so this latent imbalance was silently accepted;
now `check_block_stack` runs at every CATCH, catches the
unbalanced stack, and reports the spec-required `type mismatch:
block requires [] but stack has [i32]`. Added an explicit
`drop` in the catch body so the test's wat validates clean.

Verified end-to-end: 51/51 EH integration tests pass (was 45/45
before; +6 new result-typed cases). porf-accurate runs at 15.6
ms median (no regression vs the 17.3 ms baseline; small
improvement plausibly from the polymorphic-reset path no longer
emitting redundant POP_OFFSET_TYPE operands).
Adds a load-time warning when a br / br_if / br_table opcode
crosses one or more LABEL_TYPE_TRY / _CATCH / _CATCH_ALL
frames, because the runtime br doesn't pop the eh-stack — each
crossed try-region leaks one eh-stack entry that survives until
frame teardown.

The simple case (single br out of a try; e.g. the
`br_out_of_try_pops_eh_stack` integration test) is benign: the
per-frame eh-stack reservation
(`exception_handler_count * EH_ENTRY_CELLS` cells, covering
every static try-block in the function) leaves room for one
stale entry alongside any subsequent sibling try's push, and the
top-down walker iterates from `eh_count` down so sibling-try
throws still match the most recent push first. The stale entry
dies when the frame is freed at function return.

The pathological case — `loop { try { br_to_loop_top } catch }`
— leaks one entry per iteration and eventually overflows the
static reservation. `bh_assert(eh_count < exception_handler_
count)` would catch this, but `bh_assert` is a no-op in release
builds (`BH_DEBUG` is unset there), so the out-of-bounds writes
go through silently. The warning surfaces the shape in
load-time diagnostics so a real embedder sees it before the
hard-to-diagnose runtime corruption.

`count_try_blocks_crossed(cur_block, target_block)` walks csp
positions from cur_block down to target_block inclusive (target
included because br to a non-LOOP target lands AFTER target's
end, skipping it; LOOP targets aren't try-typed so the inclusive
vs exclusive distinction doesn't change the count). The check
fires only in pass 1 (`loader_ctx->p_code_compiled == NULL`) so
each br site logs once even though wasm_loader_prepare_bytecode
runs the bytecode twice. No hot-op cost — this is loader-time
only.

Verified: porf-accurate doesn't trigger the warning (no
br-across-try patterns in the Porffor emit shape, consistent
with the PMU profile showing zero hot-op overhead from EH).
`br_out_of_try_pops_eh_stack` integration test triggers the
warning once and still passes.
… checks

Marks the four structurally-cold paths in WASM_OP_CALL_INDIRECT —
out-of-bounds table index, uninitialized element, unknown function
(post-table lookup), indirect-call type mismatch — with
`__builtin_expect(cond, 0)`. Well-formed wasm modules pass all four
on every dispatched CALL_INDIRECT; the hint lets the compiler:

  (a) provide a static-bias fallback for the branch predictor on
      unseen call sites (first-iteration impact only — Apple
      Silicon's predictor learns the bias dynamically after a few
      hits anyway);
  (b) lay out the error-handling tail away from the hot path so
      each pass-through case stays in straight-line I-cache.

Measured on iPhone 12 (A14, Icestorm E-cores) with the
graphql-validation workloads — bucket-share deltas are within
run-to-run noise on both Porffor and AS, but the Porffor
bottleneck is `Processing` (56.78%, backend / load-store
saturation) not branch prediction (4.19% Discarded). AS's E-core
shows the structural opportunity (27.22% Discarded) but that's
the goto-indirect-branch in FETCH_OPCODE_AND_DISPATCH, not the
direct branches inside CALL_INDIRECT.

Kept as documentation-as-code: the cold-path semantic is real
(spec-required traps that ~never fire on validated modules), and
the compiler-time cost is zero. Full PMU writeup in
out/eh-pmu-iphone12-2026-05-18.md (gitignored).

No correctness change. No hot-op runtime cost. Doesn't affect EH
code paths.
The legacy exception-handling spec test suite was previously hardcoded
to skip every running mode except classic-interp:

    if [[ "${RUNNING_MODE}" != "classic-interp" ]]; then
        echo "support exception handling in classic-interp"
        return 0
    fi

Now that fast-interp supports the full legacy-EH proposal (TRY / CATCH /
CATCH_ALL / RETHROW / DELEGATE / tag-with-params), the gate should
allow both modes. This matches the parallel `ENABLE_GC` block a few
lines down that already lists `classic-interp` AND `fast-interp` as
acceptable.

After this change, `./test_wamr.sh -t fast-interp -m exception-handling`
runs the upstream WebAssembly spec EH suite against the fast
interpreter — the same suite already validated against classic
interp.
When a throw from a nested try is caught by an OUTER handler, the
walker previously left the inner-try entries between the throw site
and the matched outer entry on the eh-stack. The matched entry got
its `EH_TRY_CATCH_STATE_BIT` set, but `frame->eh_count` stayed
unchanged. After the outer catch body's END decremented eh_count by
one, the inner-try slot remained at the top of the eh-stack with
the matched outer entry now sitting *under* it (in-progress bit
set).

A subsequent throw inside (or after) the outer catch body would
walk that stale state. The walker SKIPs entries with the state bit
set, so the outer entry was correctly ignored — but the inner-try
entry (no state bit) was treated as live. If the inner try's typed
catch happened to match the new tag, the walker dispatched against
that stale entry — an out-of-scope catch.

Worse, in a tight loop of `outer try { inner try { throw }
catch_other catch_outer { ... } }`, every iteration leaked one
inner-try entry. After more iterations than the function's
`exception_handler_count`, the next TRY push wrote past the static
eh-stack reservation (silently in release builds since `bh_assert`
is a no-op without `BH_DEBUG`).

Fix: at each match-and-dispatch site in `find_a_catch_handler` —
both the typed-catch branch and the catch_all branch — set
`frame->eh_count = i;` before jumping to the handler. `i` is the
loop counter, which equals the index of the matched entry plus
one. This pops the nested-try entries above the match in a single
indexed store. The matched entry stays at index i-1 with its state
bit set; the catch body's END pops it normally when the body
completes.

Cost shape: one extra indexed store on the cold throw path, only
when a typed catch or catch_all matches. CALL / LOAD / STORE
handlers are untouched.

Test added in the external integration suite at
`crates/benchmark-core/tests/eh_correctness.rs::
outer_catch_unwinds_inner_eh_entries`. The test pattern is: outer
try catches `$err`; inner try has a catch for `$err2`. Inner throw
of `$err` is caught by outer. Outer catch body re-throws `$err2`,
which must propagate UNCAUGHT (inner try is out of scope). Pre-fix
walker found the stale inner catch and dispatched to it,
producing a Ok(99) instead of the trap; post-fix the walker has
no in-scope entries and the throw escapes correctly.

Codex P1 review feedback on rebeckerspecialties/wasm-micro-
runtime PR #2: "Unwind skipped EH entries before dispatching
catches".
The walker's "no handler in this frame" path previously set
`prev_frame->exception_raised = true` and let `return_func`
forward the throw to the caller, regardless of payload size.
This silently lost the payload: the source cells
(`throw_src_offsets`) live in *this* frame's `frame_lp`, which
return_func is about to tear down. The caller's
`find_a_catch_handler` then ran with `throw_param_cell_num = 0`,
which made any typed catch in the caller bind uninitialized
destination slots — the catch body would either see garbage in
its payload locals or, if the typed catch's slots were used as
struct-of-pointers, dereference freed memory.

Cross-function payload preservation would require a per-thread
scratch buffer to ferry the payload across the frame boundary
(callee's frame_lp → buffer → caller's frame_lp), plus a small
change to return_func to populate it before tearing down the
callee. That's a meaningful design lift and out of scope for
this commit.

Safe action for now: when a payload-bearing throw escapes its
callee (i.e. `throw_param_cell_num > 0` and we're about to
return to a caller frame), trap to the host with the diagnostic
`"cross-function exception payload not supported by fast-
interp"`. Same-function payload routing (the common Porffor /
AS shape, where a JS throw is caught by an in-function catch
the JS-to-wasm compiler emitted) is unaffected — that path
dispatches via the same-function match in the walker before
this branch runs.

A `catch_all` in the caller would technically tolerate a
zero-payload bind, but the typed-vs-catch_all choice happens in
the caller's walker, which we can't peek into here without
coupling the frames. Trap unconditionally for payload-bearing
cross-frame throws.

Tests:
* `cross_function_tag_with_params` stays `#[ignore]` — that's
  the eventual-success-case for when cross-frame payload routing
  is implemented.
* `cross_function_tag_with_params_traps` (new) asserts the
  current trap-with-expected-message contract on the same
  module shape.

Codex P1 review feedback on
rebeckerspecialties/wasm-benchmark PR #3 (patch 0007 line 306):
"Preserve cross-frame exception payloads".
…egion

When a br skips over a try-region's END, the runtime br doesn't pop
eh-stack entries. For a one-shot br to a block / function-end /
catch, the leaked entry is absorbed by the static
`exception_handler_count * EH_ENTRY_CELLS` reservation and dies at
frame teardown — a load-time `LOG_WARNING` surfaces the shape for
embedders.

If the br target is a LOOP entry, however, every iteration's TRY
push adds one more entry to the eh-stack. After more iterations
than the function's `exception_handler_count`, the next TRY push
writes past the static reservation. `bh_assert(eh_count < count)`
catches this in debug builds, but is a no-op without `BH_DEBUG` —
release builds silently corrupt whatever sat past the reservation
in the frame allocation.

This commit changes that pathological shape from "log a warning
and accept" to "fail load with an explicit error". The check sits
next to the existing `count_try_blocks_crossed > 0` warning at all
three branch sites (BR, BR_IF, BR_TABLE) and only fires when
`frame_csp_tmp->label_type == LABEL_TYPE_LOOP`. The error message
is identical at each site modulo opcode name:

  "br[_if|_table] to loop entry from inside try-region not
   supported in fast interpreter (would leak eh-stack entries
   per iteration)"

Emitting a synthetic eh-stack pop at the br site would be the
other fix and would let valid modules with this shape run, but it
complicates the rewritten IR's br-info layout (the br dispatch
currently emits a single uint32 depth; a pop-count immediate
would need a per-target lookup) and the shape is rare in
practice. Rejecting at load is the conservative, App-Store-safe
choice — embedders see a deterministic error rather than silent
memory corruption.

Test added in the external integration suite: the previously-
ignored `br_out_of_try_inside_loop` became
`br_out_of_try_inside_loop_rejected`, which asserts the loader
fails with the expected error string.

Codex P1 review feedback on both PRs ("Reject branches that leak
EH entries" / "Reject branches that leak EH stack entries").
Windows MSVC build of upstream PR bytecodealliance#4949 failed with
`LNK2019: unresolved external symbol __builtin_expect` because
`__builtin_expect` is a GCC/Clang builtin and MSVC has nothing
equivalent. The branch-predictor hints are an optimization, not
correctness, so the simplest portable fix is a no-op fallback
gated on `!defined(__GNUC__) && !defined(__clang__)`.

Lives at the top of `wasm_interp_fast.c` rather than in
`bh_platform.h` to avoid touching the shared header for a
local cold-path concern.
matthargett added a commit to rebeckerspecialties/wasm-benchmark that referenced this pull request May 21, 2026
Upstream PR bytecodealliance/wasm-micro-runtime#4949
failed every `build_iwasm` matrix entry on Windows MSVC with
`LNK2019: unresolved external symbol __builtin_expect referenced
in function wasm_interp_call_func_bytecode`. The cold-path hints
we added in patch 0011 use the GCC/Clang `__builtin_expect`
intrinsic; MSVC has no equivalent.

Drop-in no-op shim gated on `!defined(__GNUC__) && !defined(__clang__)`.
The hints are branch-predictor optimization, not correctness, so
dropping them on MSVC is fine. Same change is on the upstream PR
branch as commit `0411662d` (separate fixup commit; lands in the
PR sequence right after patch 0011's equivalent).

Stack-position rationale: patch 0024 (after linmem 0023) inserts
9 lines near the top of `wasm_interp_fast.c` between the SIMDe
include guards and `typedef int32 CellType_I32`. Putting it last
in the apply-stack avoids shifting line-number anchors for any
of the earlier patches.
@matthargett
Copy link
Copy Markdown
Author

Update: pushed 0411662d — MSVC __builtin_expect no-op shim. The Windows MSVC matrix that was failing with LNK2019: unresolved external symbol __builtin_expect is now green.

Of the remaining single CI failure (build_regression_tests (ubuntu-22.04)), the four failing tests (BA issues 2702, 2833, 270801, 270802) all execute under mode: aot / runtime: iwasm-default and crash with exit code -4 (SIGILL) on the AOT binary. This PR doesn't touch AOT codegen or AOT runtime — only fast-interp. The same SIGILL-on-AOT pattern is visible on main's recent nightly_run (test (ubuntu-22.04, asan, aot, $WASI_TEST_OPTIONS) + tsan variant — both failure, both running AOT-compiled wasm). My read is this is an upstream-wide CI-infrastructure issue introduced around the LLVM 22 bump (PR #4937) — happy to be told otherwise if I'm misreading. Either way, nothing this PR can fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant